The Gini Index is referred to as a measure of node purity (James et al. 2021). It can also be used to measure the importance of each predictor. The Gini Index is defined by the following formula where K is the number of classes and \({\hat{p}_{mk}}\) is the proportion of observations in the mth region that are from the kth class. A Gini Index of 0 represents perfect purity.
\[D=-\sum_{n=1}^{K} {\hat{p}_{mk}}(1-\hat{p}_{mk})\]
Bagging is the aggregation of the results from each decision tree. It is defined by the following formula where B is the number of training sets and \(\hat{f}^{*b}\) is the prediction model. Although bagging improves prediction accuracy, it makes interpreting the results harder as they cannot be visualized as easily as a single decision tree (James et al. 2021).
\[{\hat{f}bag(x) = 1/B \sum_{b=1}^{B}\hat{f}^{*b}(x)}\]
Dataset from Faisalabad Institute of Cardiology
299 patient records
13 features per record
Random Forest used to classify patients for heart failure
RStudio Pro 2023.12.0, Build 369.pro3
Various R libraries
RStudio Server running on RHEL9 based virtual machine within a VMware VSphere HA cluster. The VM has 50 vCPU’s and 196 GB ram assigned
Hardware includes Dell PowerEdge R750 servers with Dual Xeon Gold 6338N (32 core) CPUs, 512 GB RAM, and sfp28 25 gbit networking for all communications
Cluster storage is from an NVMe based Dell SAN.
Various metrics are used to assess the Random Forest model performance: